Goto

Collaborating Authors

 sparse data


Sparse clustering via the Deterministic Information Bottleneck algorithm

Costa, Efthymios, Papatsouma, Ioanna, Markos, Angelos

arXiv.org Machine Learning

Cluster analysis relates to the task of assigning objects into groups which ideally present some desirable characteristics. When a cluster structure is confined to a subset of the feature space, traditional clustering techniques face unprecedented challenges. We present an information-theoretic framework that overcomes the problems associated with sparse data, allowing for joint feature weighting and clustering. Our proposal constitutes a competitive alternative to existing clustering algorithms for sparse data, as demonstrated through simulations on synthetic data. The effectiveness of our method is established by an application on a real-world genomics data set.


Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning

Damie, Marc, Hahn, Florian, Peter, Andreas, Ramon, Jan

arXiv.org Artificial Intelligence

To preserve privacy, multi-party computation (MPC) enables executing Machine Learning (ML) algorithms on secret-shared or encrypted data. However, existing MPC frameworks are not optimized for sparse data. This makes them unsuitable for ML applications involving sparse data, e.g., recommender systems or genomics. Even in plaintext, such applications involve high-dimensional sparse data, that cannot be processed without sparsity-related optimizations due to prohibitively large memory requirements. Since matrix multiplication is central in ML algorithms, we propose MPC algorithms to multiply secret sparse matrices. On the one hand, our algorithms avoid the memory issues of the "dense" data representation of classic secure matrix multiplication algorithms. On the other hand, our algorithms can significantly reduce communication costs (some experiments show a factor 1000) for realistic problem sizes. We validate our algorithms in two ML applications in which existing protocols are impractical. An important question when developing MPC algorithms is what assumptions can be made. In our case, if the number of non-zeros in a row is a sensitive piece of information then a short runtime may reveal that the number of non-zeros is small. Existing approaches make relatively simple assumptions, e.g., that there is a universal upper bound to the number of non-zeros in a row. This often doesn't align with statistical reality, in a lot of sparse datasets the amount of data per instance satisfies a power law. We propose an approach which allows adopting a safe upper bound on the distribution of non-zeros in rows/columns of sparse matrices.


Quantum Noise Tomography with Physics-Informed Neural Networks

Sulc, Antonin

arXiv.org Artificial Intelligence

Characterizing the environmental interactions of quantum systems is a critical bottleneck in the development of robust quantum technologies. Traditional tomographic methods are often data-intensive and struggle with scalability. In this work, we introduce a novel framework for performing Lindblad tomography using Physics-Informed Neural Networks (PINNs). By embedding the Lindblad master equation directly into the neural network's loss function, our approach simultaneously learns the quantum state's evolution and infers the underlying dissipation parameters from sparse, time-series measurement data. Our results show that PINNs can reconstruct both the system dynamics and the functional form of unknown noise parameters, presenting a sample-efficient and scalable solution for quantum device characterization. Ultimately, our method produces a fully-differentiable digital twin of a noisy quantum system by learning its governing master equation.


Domain Adaptation and Multi-view Attention for Learnable Landmark Tracking with Sparse Data

Chase, Timothy Jr, Dantu, Karthik

arXiv.org Artificial Intelligence

The detection and tracking of celestial surface terrain features are crucial for autonomous spaceflight applications, including Terrain Relative Navigation (TRN), Entry, Descent, and Landing (EDL), hazard analysis, and scientific data collection. Traditional photoclinometry-based pipelines often rely on extensive a priori imaging and offline processing, constrained by the computational limitations of radiation-hardened systems. While historically effective, these approaches typically increase mission costs and duration, operate at low processing rates, and have limited generalization. Recently, learning-based computer vision has gained popularity to enhance spacecraft autonomy and overcome these limitations. While promising, emerging techniques frequently impose computational demands exceeding the capabilities of typical spacecraft hardware for real-time operation and are further challenged by the scarcity of labeled training data for diverse extraterrestrial environments. In this work, we present novel formulations for in-situ landmark tracking via detection and description. We utilize lightweight, computationally efficient neural network architectures designed for real-time execution on current-generation spacecraft flight processors. For landmark detection, we propose improved domain adaptation methods that enable the identification of celestial terrain features with distinct, cheaply acquired training data. Concurrently, for landmark description, we introduce a novel attention alignment formulation that learns robust feature representations that maintain correspondence despite significant landmark viewpoint variations. Together, these contributions form a unified system for landmark tracking that demonstrates superior performance compared to existing state-of-the-art techniques.


Reconstructing dynamics from sparse observations with no training on target system

Zhai, Zheng-Meng, Huang, Jun-Yin, Stern, Benjamin D., Lai, Ying-Cheng

arXiv.org Artificial Intelligence

In applications, an anticipated situation is where the system of interest has never been encountered before and sparse observations can be made only once. Can the dynamics be faithfully reconstructed from the limited observations without any training data? This problem defies any known traditional methods of nonlinear time-series analysis as well as existing machine-learning methods that typically require extensive data from the target system for training. We address this challenge by developing a hybrid transformer and reservoir-computing machine-learning scheme. The key idea is that, for a complex and nonlinear target system, the training of the transformer can be conducted not using any data from the target system, but with essentially unlimited synthetic data from known chaotic systems. The trained transformer is then tested with the sparse data from the target system. The output of the transformer is further fed into a reservoir computer for predicting the long-term dynamics or the attractor of the target system. The power of the proposed hybrid machine-learning framework is demonstrated using a large number of prototypical nonlinear dynamical systems, with high reconstruction accuracy even when the available data is only 20% of that required to faithfully represent the dynamical behavior of the underlying system. The framework provides a paradigm of reconstructing complex and nonlinear dynamics in the extreme situation where training data does not exist and the observations are random and sparse.


A Novel Framework for Analyzing Structural Transformation in Data-Constrained Economies Using Bayesian Modeling and Machine Learning

Katende, Ronald

arXiv.org Machine Learning

Structural transformation, the shift from agrarian economies to more diversified industrial and service-based systems, is a key driver of economic development. However, in low- and middle-income countries (LMICs), data scarcity and unreliability hinder accurate assessments of this process. This paper presents a novel statistical framework designed to address these challenges by integrating Bayesian hierarchical modeling, machine learning-based data imputation, and factor analysis. The framework is specifically tailored for conditions of data sparsity and is capable of providing robust insights into sectoral productivity and employment shifts across diverse economies. By utilizing Bayesian models, uncertainties in data are effectively managed, while machine learning techniques impute missing data points, ensuring the integrity of the analysis. Factor analysis reduces the dimensionality of complex datasets, distilling them into core economic structures. The proposed framework has been validated through extensive simulations, demonstrating its ability to predict structural changes even when up to 60\% of data is missing. This approach offers policymakers and researchers a valuable tool for making informed decisions in environments where data quality is limited, contributing to the broader understanding of economic development in LMICs.


Enhancing Startup Success Predictions in Venture Capital: A GraphRAG Augmented Multivariate Time Series Method

Gao, Zitian, Xiao, Yihao

arXiv.org Artificial Intelligence

In the Venture Capital(VC) industry, predicting the success of startups is challenging due to limited financial data and the need for subjective revenue forecasts. Previous methods based on time series analysis or deep learning often fall short as they fail to incorporate crucial inter-company relationships such as competition and collaboration. Regarding the issues, we propose a novel approach using GrahphRAG augmented time series model. With GraphRAG, time series predictive methods are enhanced by integrating these vital relationships into the analysis framework, allowing for a more dynamic understanding of the startup ecosystem in venture capital. Our experimental results demonstrate that our model significantly outperforms previous models in startup success predictions. To the best of our knowledge, our work is the first application work of GraphRAG.


APS-USCT: Ultrasound Computed Tomography on Sparse Data via AI-Physic Synergy

Sheng, Yi, Wang, Hanchen, Liu, Yipei, Yang, Junhuan, Jiang, Weiwen, Lin, Youzuo, Yang, Lei

arXiv.org Artificial Intelligence

Ultrasound computed tomography (USCT) is a promising technique that achieves superior medical imaging reconstruction resolution by fully leveraging waveform information, outperforming conventional ultrasound methods. Despite its advantages, high-quality USCT reconstruction relies on extensive data acquisition by a large number of transducers, leading to increased costs, computational demands, extended patient scanning times, and manufacturing complexities. To mitigate these issues, we propose a new USCT method called APS-USCT, which facilitates imaging with sparse data, substantially reducing dependence on high-cost dense data acquisition. Our APS-USCT method consists of two primary components: APS-wave and APS-FWI. The APS-wave component, an encoder-decoder system, preprocesses the waveform data, converting sparse data into dense waveforms to augment sample density prior to reconstruction. The APS-FWI component, utilizing the InversionNet, directly reconstructs the speed of sound (SOS) from the ultrasound waveform data. We further improve the model's performance by incorporating Squeeze-and-Excitation (SE) Blocks and source encoding techniques. Testing our method on a breast cancer dataset yielded promising results. It demonstrated outstanding performance with an average Structural Similarity Index (SSIM) of 0.8431. Notably, over 82% of samples achieved an SSIM above 0.8, with nearly 61% exceeding 0.85, highlighting the significant potential of our approach in improving USCT image reconstruction by efficiently utilizing sparse data.


The Sparse Tsetlin Machine: Sparse Representation with Active Literals

Østby, Sebastian, Brambo, Tobias M., Glimsdal, Sondre

arXiv.org Artificial Intelligence

This paper introduces the Sparse Tsetlin Machine (STM), a novel Tsetlin Machine (TM) that processes sparse data efficiently. Traditionally, the TM does not consider data characteristics such as sparsity, commonly seen in NLP applications and other bag-of-word-based representations. Consequently, a TM must initialize, store, and process a significant number of zero values, resulting in excessive memory usage and computational time. Previous attempts at creating a sparse TM have predominantly been unsuccessful, primarily due to their inability to identify which literals are sufficient for TM training. By introducing Active Literals (AL), the STM can focus exclusively on literals that actively contribute to the current data representation, significantly decreasing memory footprint and computational time while demonstrating competitive classification performance.


Sign Cauchy Projections and Chi-Square Kernel

Neural Information Processing Systems

In this paper, we propose to use only the signs of the projected data and we analyze the probability of collision (i.e., when the two signs differ). Interestingly, when α = 1 (i.e., Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square (χ